from IPython.display import Image
The base data set from which we are starting has appoximately 35 features. The data set was cleaned and pre-processed for analysis, as outlined in the prior sections of this report - missing values identified, outliers dispositioned, and also all features re-scaled to standard normal distribution.
In the nominal data set some features are naturally scaled from 0 to 1 (real values), such as the Latent Dirichlet Allocation (LDA) measures, while other features are measured in the range of 0 - 800,000 ! (e.g. number of shares in social media context). Since both dimensionality reduction and cluster analyses depend on relative magnitudes, all features were mapped to standard normal distribution to provide even weighting of all features in the mapping / clustering processes. The binary features (e.g., is_data_channel_technology) are all retained as binary 0/1 valued features and one-hot encoded to similarly support evenly distributed distance evaluations among such categorical features.
Early efforts in which we attempted to use the cleaned data set and perform cluster analyses yielded results which did not provide straightforward interpretations of the clustering results. Visually, the cluster maps did not provide well-organized presentations of clusters and the silhouette and distortion metrics were generally disorganzied as a function of the number of clusters - these metrics were not smooth functions that indicated in any clear sense an optimal or even preferred number of clusters from those analyses. Methods attempted at that point included k-means, DBSCAN, and Spectral Clustering.
Thus, we were motivated to explore dimensionality reduction as a means to simplify the data set that we presented to the clustering algorithms. Evaluating choices for dimensionality reduction we considered Principal Components Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE). Between the two methods, we decided to evaluate t-SNE.
Image("../cluster/t_sne_divergence_process_time.png")
Image("../cluster/t-sne_perplx_plots/perplex_0010.png")
Image("../cluster/t-sne_perplx_plots/perplex_0100.png", retina = True)
Image("../cluster/t-sne_perplx_plots/perplex_0200.png")
Having completed the t-SNE mapping, the next step in the process was to apply different clustering methods for evaluation of appropriate clustering results.
In our evaluation, we chose to evaluate with
These three methods have fundamental differences and we assessed that they can provide different opportunities on this data set to provide opportunities to provide at least some success.
#### K-Means Clustering
Thus, by standard measures, appropriate choices for number of clusters from this k-means clustering analysis is 12 or 13, and also it is reasonable to evaluate the clusters with k = 7.
Image("../cluster/cluster_kmeans_number_of_clusters_eval.png")
### K-Means Clusters Evaluation
To evaluate the resulting clusters for each of the above candidate values of k (7, 12, 13), the following approach is taken :
construct visual interpretation aid of a 3-plot set for each feature as shown in below figures. Each 3-set of plots includes the following :
These plots show the following relationships :
Image("../cluster/cluster_spctrl_3way_preplx_100_ln_LDA_00.png")
Image("../cluster/cluster_spctrl_3way_preplx_100_ln_LDA_01.png")
Image("../cluster/cluster_spctrl_3way_preplx_100_ln_LDA_02.png")
Image("../cluster/cluster_spctrl_3way_preplx_100_ln_LDA_03.png")
Image("../cluster/cluster_spctrl_3way_preplx_100_ln_LDA_04.png")
To further understand the cluster relationships, an additional view is presented. For each cluster the mean value of each feature in that cluster was determined, the standard deviation of those means, and a z-score of each mean relative to other means in that cluster were compared. The goal was not to assess these z-scores for statistical signficantly different means, but rather as a method to identify in a consistent way the relative participation of each feature in each cluster. The goal is to identify the few most (both positively and negatively) impactful features in defining the cluster characteristics. The median of each cluster, or some other statistic, could have also been used for this purpose. For the clusters developed for this data set, means and medians provide essentially the same view of the major contributors to a cluster characterization.
The plots below show these distributions of means for each feature in each of the clusters developed from the above k-means application.
As an example, we can make the following observations of the clusters based on these plots (and a detailed examination of the underlying values in a data table)
Cluster 01
Cluster 04
Similary, an interpretation was completed for each cluster based on the relative measure of feature means distribution within each cluster.
As will be shown in subsequent sections, this exercise was repeated for each of the clustering methods deployed. A synopsis of relevant clusters from the overall analysis will be presented in the summary section.
Image("../cluster/cluster_kmeans_cluster_barplots.png")